System Administration Summary and Tasks 9-6-14

AS(2/12):
Status of nodes:-
Kali: Not possible to login
Brahman: Possible to login
slave001: There is no public ip mentioned in sys admin page..
slave003: Not possible to login
slave004: Possible to login
slave005: There is no public ip mentioned in sys admin page.
slave006: Not possible to login

-- slave001, slave003, slave004, slave005 can be logged in from 'Brahman'
-- slave006 can't be logged in from 'Brahman'
AS(24/07): Schrodinger Software And Licences2012 Schrodinger license notes.pdf
SysLog on Dec. 8, 2014

PL(12/8/14): It appears that the pmc-at network was down over the weekend. Now the network is back to normal. It appears some of the original DNS server might be outdated.

RC: The DNS for pmc-research.com is still hosted elsewhere (see below). Perhaps there was an issue on godaddy.
Note that our machines can also be reached at pmc-at.com.

SysLog on Dec. 1, 2014

slave004 became inaccessible again on 12/01/2014.hard reboot was performed on 12/01/2014 at 2:00PM. nfs server was restarted, and nis client is restarted.
RC: please check slave004 log files. Slave004 was running a cron job every 15 minutes that may have caused problem. It has now been stopped.
The task on syncing dates/times on various nodes has not been finished yet. It causes some inconvenience.

SysLog on Oct 29, 2014

slave004 became unresponsive on Oct 28 03:44:56 as there is an interruption on log messages. server is not responsive to the soft restart, and hard reboot was performed on Oct 29 15:30. nfs server was restarted, and nis client is restarted.

1) Servers

Kali: custom-built 2 U node; 75.150.132.106; dhcp server kali.pmc-research.com
Brahman: custom-built 3 U node; 75.150.132.108; ftp server, torque server, nis server, tftp server (for kickstart installations of new nodes) pmc-research.com; brahman.pmc-research.com; alias ftp.pmc-research.com
slave001: custom-built 1U node; 2nd from top [no public ip]; ip masquerading
slave003: DL180 with RAID array – 75.150.132.105; to become fileserver (will later be renamed) slave003.pmc-research.com
slave004: DL160, first from top – 75.150.132.107; a dev server and master node for mpd. Will also be used as a webserver when required. (may be renamed) slave004.pmc-research.com; alias www.pmc-research.com
slave005: DL160, 4th from top, no remaining static public ip, connected via dhcp to router; compute node
slave006: DL160, 3rd from top – 75.150.132.109; compute node slave006.pmc-research.com
Ask Wei to label these machines.

XXbrahman (as ftp server, torque server, nis server) and slave003 (fileserver, to be renamed brahma) will both be assigned registered fqdns on the pmc-research.com domain. (see above)
dns server: godaddy.com -> username rchakra (password to be provided) -> domains -> pmc-research.com; zone file
dns may later be transferred to us

All the above hosts are now also registered on pmc-at.com, with the exception of ftp and www.
Thus, e.g., one can log in at brahman.pmc-at.com

Wei will look into purchasing additional storage for the fileserver RAID array.
Large files that are not routinely used should be archived with tar and depending on size, may be stored either in /home2 on the fileserver or on another node with suitable disk space.


2) NIS

(For more details and background, see Computing_Notes/Cluster/Servers_file,ftp,web.doc on Dropbox)


NIS server is brahman. May eventually move nis server to slave003, but leave it on brahman for now

To-do:
-Edit the netgroups so that there is one for each development project
-XXDelete obsolete accounts using userdel and if necessary, groupdel (see also NFS section)
-XXRemake the NIS database and restart server

Procedure for new account creation on NIS server

1) On brahman, root should use useradd -g 500 <user_name> and passwd <temp_password> to set temporary password. There are useradd default files that one can configure, if desired, to simplify this process.
Home directories will be made on fileserver (slave003) but through commands on NIS server (brahman).
[Consider which groups to assign to, any use of netgroups (optional for now): we will assign all new users to group 500 (rajchak).]
2) database to be remade on NIS server (ypinit –m or /usr/lib64/ypinit –m depending on path) after each new account added.
3) restart nis server (service ypserv restart) after rebuilding the database

4) new user logs in and changes his/her password with yppasswd
5) root restarts ypserv again (step 3)


Ensure that for new nodes,the /etc/nsswitch.conf file is edited (possibly with kickstart script, see below). The critical parts of the /etc/nsswitch.conf file are the lines starting with passwd, shadow and group. These should also be set to "nis nisplus files".
Also ensure that the /etc/sysconfig/network file has the entry "NIS_DOMAIN=mypmc.nis" and /etc/yp.conf has the entry "domain mypmc.nis server brahman.pmcresearch.com" (this is the case for all existing nodes).

Sudo configuration for users:
- /etc/sudoers: entries are of form <username> ALL = (ALL) <NOPASSWD:> <commands> (see man page)
- First ALL refers to the machines this rule applies to, (ALL) refers to the users that one can sudo to, <NOPASSWD:> is optional and means one need not enter a passwd before proceeding, <commands> is a white space-separated list of sudo commands allowed
- The /etc/sudoers file can be nfs shared but this is generally not important since users usually pass sudo commands on just one or two machines; we will choose which machines

Do not make any local user accounts on other nodes.

Use userdel -r to delete a user along with home directory.


RC: Note AS should be capable to manage user accounts, with exception of users in US using yppasswd. He should make accounts for Anil and Subrata.


3) NFS

(For more details and background, see Computing_Notes/Cluster/Servers_file,ftp,web.doc)

-Slave003 will become the new fileserver. It is a DL180 with a RAID array.

To do:

XBack up/remove the obsolete directories prior to exporting /home from slave003. Note slave004 local /home directory has several of the academic linux cluster home directories backed up. Kali has many obsolete home directories as well. These will all be archived on slave003 (contents combined where possible)

Plan for cleaning up of obsolete home directories (to be implemented shortly):
- XAll archived directories (e.g. from slave004 and slave006) should be placed on slave003 (the fileserver) is /home2 or similarly named archive directory
- XOther directories except /home to be shared:
/usr/src on slave004. E.g., usr/src/mpich2-1.0.7 contains source code for mpi examples, which need to be accessed by various nodes running mpi. May later rename this node to indicate it is a dev server and master node for mpd. Will also be used as a webserver when required.
XXPL: I suggest we create a shared directory (e.g. /global) to host any shared software, common tools, etc.
ftp module directories where data is continuously written (see below)
- XXDisable or comment out any other unnecessary filesharing from /etc/fstab
- XXRetain the local rajchak home directory on all nodes, in case some local directories contain important files
- If it is difficult to determine how to combine contents of some obsolete home directories, retain the originals in /home and fileshare over them. /home/rajchak/mpich2-install/bin on slave004 should be copied to nfs shared /home on slave003.
- XXFollowing local home directories appear to currently exist on slave003: anisha, jmathew, akoswara, kmarimuthu. Move all these to /home2
AS(03/11): Done

- XXCvs repositories on slave004 should be moved to slave003. Start sharing home from slave003 rather than kali. Remove the slave004 home/rajchak/cvs_repository share from slave 003 since we will copy the directory to slave003 filesystem. Cvs repos will later be replaced by git
AS(03/11): Already Done
I saw using the command cat /etc/exports & cat /etc/fstab
- ssh: after filesharing of home directories, the contents of .ssh directories for each user are obsolete. To enable passwordless ssh (convenient for sysadmin/users that routinely login to multiple nodes), generate private keys on each machine for that user in and copy the public keys to other machines - done for rajchak X
AS(03/11): Need to get more information from google
RC:You should read the section on key generation in Security and Permissions document in this folder, which discusses key generation with ssh-keygen.
You need to generate public and private keys with ssh-keygen and then edit the authorized_keys file in .ssh directory for the user accordingly with the public key. As noted since all nodes share the same home directories this simplifies the task since there is only one .ssh directory for each user (this is not the case for root). Please see the rajchak directory for an example.
Please use type rsa for key type.
Upon testing passwordless ssh it should update the known_hosts file.
AS(04/11): I am able to auto ssh from user ashee to rajchak.
RC: What about passwordless ssh between ashee on different machines?
AS(11/11):- passwordless ssh between ashee on different machines like brahman to slave003, slave004, slave006 happening.
I can't login to slave003, slave004 using my user id ashee(I could login to those servers) even I have created user for Kaushik Bakshi(user id kbakshi password kbakshi) can't login to slave003, slave004, slave005 except brahman.
RC: It may be an issue with NIS not being updated. Please read the instructions above under NIS on the steps to be followed after new user creation in order to ensure that the information has propagated through NIS after making the new user.
Do same for root
- root_no_squash option should be passed in /etc/exports in order to give root access to directories on other machines that are sharing a directory through nfs. if all_squash option is specified in /etc/exports, meaning that the user on the sharing machine is treated anonymously (as “other” user) – not needed/desirable due to NIS
AS(03/11): Already Done
RC: Please provide estimated size of new software to be installed.
Obsolete /home2/plin directories should be removed.
AS(03/11): Done

4) PBS and MPI

(For more details and background, see Computing_Notes/Cluster/job_submission.doc)

Torque/PBS server is brahman.

To-do:
- Remove the file copy commands from an example job.ds script in /home/rajchak/Simulation, which are no longer required due to nfs filesharing.
- Review submit.sh and job.ds example scripts that will be used with modifications to submit pbs jobs indexed by multiple parameters (paths to these scripts: ~rajchak/Simulation/; e.g., submit_mlab.sh and job_mlab.ds). Modify scripts so data is written to appropriate ftp shared directories.
- All scratch files (independent of user or working directory for that user) will go in ~rajchak/localscratch.
- test ssh submission of batch server jobs (see PL and AS tasks) - preliminary testing done for rajchak


MPI server is slave004. Note: MPI configuration is not a priority at this time.
- Brahman currently has the compilers necessary for C source code compilation.: install the same compilers on slave004? For now, compile on brahman
- XXMany of the C libraries required for compilation of our source codes are in /home/rajchak on brahman and this directory will be moved to slave003.
- Change mpd.hosts file in home directory to include brahman not brahma. Users may be asked to login to brahman to compile until compilers are installed on slave004
- Mpi tests with two nodes (slave004 –a development server, and brahman) with example cpi code passed w two different users
- Same as above for mpi – avoid user-specific mpd setup, give sudo rajchak permissions for the needed commands
- mpds must be started separately by each user

Test the torque and mpi servers from an account other than rajchak
For both torque and mpi, should be possible to execute the job submission commands remotely from any node using ssh <cmd>. E.g. ssh brahman "qsub ~rajchak/Simulation/...ds" Test this. XX


5) FTP

(For more details and background, see Computing_Notes/Cluster/Servers_file,ftp,web.doc)


FTP server is brahman. PMC-AT will use it to share files that are too big/slow to share via dropbox and files that are primarily of interest to computational scientists. Users who do not have accounts on our linux cluster can also use ftp to download and upload files from any computer.
AS(04/11): I can download, upload from my local machine using ftp put and ftp get command
- ftp get and put:use anonymous or ftp user login, no password
- Choose the directory where output of cluster biochem jobs will be copied. Should be subdirectory of /var/ftp/pub/ . Ideally, mount a subdirectory of /var/ftp/pub on the fileserver wherein the relevant files are being written using /etc/fstab and service netfs restart.
- /var/ftp/pub/incoming has been enabled for ftp put command by setting flags in /etc/vsftpd/vsftpd.conf file
AS(04/11): There is no /var/ftp/pub/incoming value in the file /etc/vsftpd/vsftpd.conf file
RC: It is not a value, but rather a flag allowing uploads as I recall. This is not a task.
- rsync -avz brahman::ftp/... <target_dir> is required to update a local target directory without use of a password. This could be used e.g. with a mac or linux machine to sync ftp repository .
- specialftp module on brahman can be used to store files that should only be accessible by PMC-AT staff (the users who can access it must be specified in the rsyncd.conf file). This subdirectory
can also be (renamed and) mounted on a fileserver directory.
RC: The specialftp module, subdirectory, and who has permission to use it should be discussed with PL.




6) Yum

(For more details and background, see Computing_Notes/Cluster/Software Installation 3.doc)


Install epel repository on all new nodes (already installed on current nodes):
rpm -Uvh http://www.nic.funet.fi/pub/mirrors/fedora.redhat.com/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm installs epel for CentOS 5.
This should be part of the kickstart script. It largely installs the epel repo info in the /etc/yum.repos.d/ folder (as well as adding some documentation for the repo).

- Use yum update or upgrade to manually update packages as new versions become available
- Sometimes may be necessary to use yum erase to manually remove an obsolete package, followed by yum install to install the latest version
- fastestmirror plugin often slows down updates. to speed up yum updates, may use flag --disableplugins
- AS should look into automatic yum updates, which could be useful for maintenance of our applications, by reading the yum manpages.
-Brahman, kali and slave001 machines do not have yum support due to lack of RHN. [One reason to later migrate servers to other nodes.] Consider registering these machines with redhat with rhn_register. ssl certificate expiration: need to reinstall rhn_client_tools rpms to update (note dependencies)


7) Kickstart


(For more details and background, see Computing_Notes/Cluster/Software Installation 3.doc)


- anaconda-ks.cfg in /root is the kickstart configuration file.
- It should be edited to run a customized script that will automatically configure the node per PMC-AT standards. This script needs to be developed.
- in particular yum install to install all packages of interest that are not typically installed as part of linux installation
- tftp daemon can be turned on via chkconfig and is constantly running on brahman. The pxelinux file on brahman is also required for network installation. Set the boot method to network (pxe boot) with appropriate specification of the ethernet port mac on the tftp client machine to be installed. Specify (on the tftp server) the mac address of the client machine to be installed.


8) License servers

Matlab and Schrodinger license/installation information will be copied/updated here.
Add paths to installed binaries to user profiles where appropriate

Matlab path is /usr/src/MATLAB/R2013a/bin
Matlab run from command line on some cluster nodes currently throws an error regarding the lack of a display (xterm)

RC: Regarding matlab licenses, please provide info on cost of additional licenses for batch jobs. Currently licenses are configured to root user. They may be modified to specify app user.

PL (11/18/2014): MATLAB update:
MATLAB version 2013a is reinstalled on slave003 and slave004 (reach the maximum allowed) under /usr/src/MATLAB/R2013a by user app. Command can be executed directly by user "app" on slave004 and slave003. A link is created under /home/app/bin/matlab,
To run matlab in text mode, use command "matlab -nojvm".
The installation files, including license files for activation is available under /home/app/src/.
I will remove outdated installation files under /usr/src.

Schrodinger 2014-3 is installed by user "app" under /usr/src/Schrodinger2014.
To run Schrodinger, run "export SCHRODINGER=/usr/src/Schrodinger2014; export PATH=$SCHRODINGER:$SCHRODINGER/utilities:$AMBERHOME/bin:$PATH"
if not run from slave003, also run " export [email protected]".
then run the command "maestro" to start the GUI.

9) Miscellaneous

-cron scripts making regular backups of essential directories are currently written and must be modified given new filesystem architecture
-align date/time across all nodes RC: This should be done shortly.

-----------------------------------------------


PMC-AT System Administration Dropbox Folder:


https://www.dropbox.com/sh/591xqz68zyyi5i1/AABDrILVMIl96pAV8-4BAXcCa?dl=0